Non-parametric resampling of random walks for spectral network clustering 
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Parametric resampling schemes have been recently employed in complex network analysis to 
assess the statistical significance of graph clustering and the robustness of community partitions. 
Here we propose a method to detect significant communities in complex networks based on the non- 
parametric resampling of the transition matrix associated with an unbiased random walk on the 
graph. We test this non-parametric approach on real- world and synthetic modular networks and we 
show that it substantially improves the existing algorithms based on spectral network decomposition. 
We finally discuss how bootstrap distributions can be used to assess the significance of different 
network properties without any a-priori assumption about the structure of the ensemble to which 
the network belongs. 
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In the past decade, network science has proven to be 
a robust and comprehensive framework to investigate, 
model and understand the structure and function of the 
complex interaction patterns observed in diverse biologi- 
cal, physical, social and technological systems HllSl. One 
of the central problems in the characterization of complex 
networks is the identification of communities, i.e. tightly- 
knit groups of nodes which exhibit poor connectivity with 
the rest of the graph [4j. In fact, experimental evidences 
confirmed that communities are the meso-scale building 
blocks of complex networks: they usually correspond to 
functional modules in the brain [SJ [6] , to topical clusters 
in social and communication networks [T, to metabolic 
reactions and functional domains in protein interaction 
networks [51 l^j , to disciplines and research areas in col- 
laboration networks HI. For this reason, many different 
community detection algorithms have been proposed |10j . 

A typical problem of networks analysis is that each 
network represents a single observation drawn from an 
unknown distribution of graphs having a certain set of 
characteristics |llj . Consequently, there is no predefined 
way to assess the statistical significance of any metric 
estimated on a given network. 

A widely approach to estimate the statistical signifi- 
cance of an observed network property takes into account 
randomized network ensembles, i.e. sets of graphs ob- 
tained from the original network by keeping fixed some 
structural properties (e.g. the degree sequence or the 
clustering coefficient) and rewiring the edges at ran- 
dom jT^HH]. In the case of community detection, this 
approach led to the definition of the modularity function, 
which quantifies the significance of a given community 
partition of a graph as the deviation from the average 
modularity expected in an ensemble of random graphs 
having the same degree sequence [7 . 

Another possibility is parametric bootstrapping, in 
which the significance of a network property is as- 
sessed against small perturbations of the graph connec- 
tivity [TSUlTj . This approach relies on the hypothesis 



that the observed network is representative of a set of 
graphs (a model) having a certain (a priori known) dis- 
tribution of structural characteristics. Consequently, the 
significance of an observed network feature is estimated 
as the deviation from the average of the corresponding 
model. Many different parametric resampling schemes 
have been used to assess the robustness of the modular 
networks against small perturbations of the connectivity. 
However, all these methods require an a priori hypothe- 
sis about the model to which the network belongs, so that 
the unbiased statistical assessment of a network partition 
remains an open challenge |18j . 

A solution which does not require any a priory knowl- 
edge is non-parametric bootstrapping, a computer-based 
technique for providing the statistical confidence of al- 
most any statistical estimate [19j [20]. The key princi- 
ple of non-parametric bootstrapping is to simulate re- 
peated observations from an unknown population using 
the available data samples (in our case, a single network) 
as a starting point. 

In this Brief Report we propose a method to identify 
statistically significant community partitions using non- 
parametric bootstrapping. The method is based on the 
construction of replicates of the transition matrix of the 
network, and in estimating an average distance matrix 
whose elements represent the expected spectral distances 
between pairs of nodes of the graph. Then, the obtained 
distance matrix is fed into a hierarchical clustering algo- 
rithm. By using non-parametric bootstrapping we finally 
construct a community partition for a graph by combin- 
ing the community partitions obtained on the different 
bootstrap replicas of the same graph. This approach 
is in the same line of ensemble or consensus clustering 
methods, which combine several partitions generated by 
different clustering algorithms — or by different runs of 
the same algorithm — into a single statistically signifi- 
cant partition p]51 [T71 I^Tl - Bl] . We analyze the community 
partitions obtained by non-parametric bootstrapping on 
different synthetic and real- world modular networks, and 



we show that this approach can substantially improve 
the performances of existing spectral clustering meth- 
ods. Moreover, we illustrate how non-parametric boot- 
strap can be also used to assess the significance of other 
structural properties, e.g. the spectral distance among 
nodes. 

Bootstrapping random walks on graphs.- Let us con- 
sider a connected undirected and unweighted graph 
G{V,E) with TV = \V\ nodes and K = \E\ edges, associ- 
ated to the binary adjacency matrix A = {aij} {aij = 1 
if there is an edge connecting node i and node j, while 
Oij — otherwise). We define an unbiased random walk 
on the graph G as the time invariant crgodic Markov 
chain whose transition matrix reads P = {Pij}, where 



Oij/ki is the probability for a walker on node i to such that |Ao| > |Ai| > 



Spectral clustering with bootstrap information.- Spec- 
tral clustering is a widely used technique to identify clus- 
ters and communities in a graph. It consists in projecting 
each node i of G into a point Xi of an appropriate met- 
ric space X. The coordinates of Xi in X are given by 
the components associated to node i of the eigenvectors 
of the adjacency matrix A of G or, more frequently, of 
the transition matrix P [51 [H] . The mapping of G on the 
space X allows to cluster the nodes according to their Eu- 
clidean distance in X, so that nodes whose projections in 
X are closer have a higher probability to be put in the 
same cluster. 

The transition matrix P of the random walk on G is 
characterized by a set of eigenvalues {Aq, Ai, . . . , Ajv-i} 
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jump, in one time step, from node i to one of its neigh- 
bours j (here we denote by ki = ^ a^ the degree of 
node i). If we call Q{t) — {qi{t)} the vector of the prob- 
abilities to find a walker on each node i at time i, such 
that '^iQiit) = IVf, the time evolution of the random 
walk is given by the equation: 



Q(i + l) = PTQ(t) 



If the adjacency matrix A is primitive, the Perron- 
Frobenius theorem guarantees that Eq. (IT]) converges 
in finite time to a unique stationary state distribution 
Q* = {g*}, such that Q* = PTQ* ^^. 

The authors of Ref. |17] proposed a generic bootstrap 
scheme to resample the transition probabilities of a fi- 
nite state time-invariant Markov chain. Starting from a 
realization x of the Markov chain, one constructs the 

maximum likelihood estimator of the associated tran- 

f ■ 
sition matrix P as Pij = -r-, where fij the observed 

number of transitions from state i to state j in x ^^'^ 
,fi = ^j fij- Then, replicates of the observed transition 
matrix are obtained by drawing the random variables 
{fill ■■■! fm} ^ Multinomial(/i;pii, ■ ■ • ,PtN) according 
to Pij = -r-. The distribution of P is then obtained by 
Monte-Carlo sampling. This approach was shown to be 
asymptotically valid for approximating the sampling dis- 
tribution of P [27^ , and has been also used to assess the 
confidence intervals of transition probabilities in disease 
modeling [281. 

Since the unbiased random walk on the graph G 
defined by Eq. (IT]) is a finite-state time-invariant 
Markov chain, we can construct a similar resampling 
scheme in which replicates of the transition matrix 
P are obtained by randomly drawing the variables 
{f*^,..., f*pf} from a multinomial distribution with prob- 
abilities {pii, ■ • ■ ,PiAr}, conditional on the observed num- 
ber of edges ki. Each replicate of the original transition 

matrix is estimated as Pij — j^- It is worth noticing 
that, in contrast to previous approaches where each link 
was resampled independently from the others, here the 
replicas of the transition probabilities for each node i are 
drawn from a multinomial distribution, accounting for 
the observed transitions to other nodes {pn, ■ ■ . ,PiN}- 



Xk is associated to the left and right eigenvectors ipk 
and ipk, which satisfy ipiP = Xk^l and Pijjk = Xkipk, 
respectively. The mapping of the matrix P on an Eu- 
clidean space produces a geometric diffusion of points 
such that the distance between the points Xi and Xj , cor- 
responding to the nodes i and j, can be written as [30] : 
Efc>i ^fc(^fc(i) -tpkU))"^ where Vfe(j) denotes the 
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(1) jth component of the fc*'* right eigenvector (notice that 



Vo — Q* and ^Q is a constant vector). This distance 
can be approximated by using only the first f3 nontrivial 
eigenvectors and eigenvalues: 



di^j2^niMi)~Mj)y 



(2) 



n=l 



This mapping transforms the diffusion distance between 
nodes of the graph into the Euclidean distance on the 
space M.^ . The elements {dij} of the matrix D represent 
the distance between each pair of points Xi and Xj in the 
lower dimensional space X = M^. 

Given a transition matrix P associated to an unbi- 
ased random walk on the graph G, we generate B boot- 
strap transition matrices {Pi, P2, . . . , Pb}- We project 
each matrix P]-, into R^ using Eq. (J2|, and we estimate 
the corresponding bootstrap distance matrix Db, whose 
entry d^- is the Euclidean distance between Xi and Xj 
in M." . Then, we compute the average distance matrix 
D* = sEb^b. The matrix D* = {d*j} effectively 
quantifies the dissimilarity between any pair of vertices 
of G (the smaller the distance d* the more similar are 
i and j), in terms of the average distance between their 
projections in R^ across several replicas of P. 

Since D* is a dissimilarity matrix, the optimal parti- 
tion of G in non-overlapping clusters can be obtained by 
using any method based on the iterative aggregation of 
clusters having minimal dissimilarity. In particular, we 
adopted a hierarchical clustering method which starts by 
considering each node as a separate cluster, and succes- 
sively merges the two clusters having minimal distance. 
The distance of two clusters Cp and Cq is computed as 
the average distance between any pair of nodes {i,j) such 
that i G Cp and j € Cq (also known as average linkage, 





0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.2 0.3 0.4 0.5 0.6 0.7 0.2 0.3 0.4 0.5 0.6 0.7 



M 



M 



FIG. 1. Benchmark networks. The variation of information VI as a function of the proportion of inter-modules links /i in GN 
graphs (a) and as a function of the mixing parameter fj, in LFR500 (b) and LFR2000 (c) graphs. The region inside each curve 
includes the 5"* and the 95"* percentiles of VI across R different runs. The four curves in each panel correspond to the optimal 
partitions obtained using the distance matrix D corresponding to P for /? = 1 (light gray) and /3 = 10 (checked pattern), the 
distance matrix D* corresponding to the bootstrapped matrix P* for P — 1 (black), and the modularity optimization on the 
adjacency matrix A, as described in Refs. [71[33 (dark-gray). The network order A'^, the number of runs R, and the number B 
of bootstrap reahzations for each run are (a) N = 128, R = 100, and B = 100; (b) iV = 500, R = 100, and B = 100; and (c) 
N = 2000, R = 50, B = 50. 



see Sect. 4.2 in Ref.[T^ for details). The algorithm stops 
when all the nodes have been grouped in a single cluster. 
The hierarchical clustering algorithm produces a dendro- 
gram H, i.e. a tree where each of the iV — 1 internal 
nodes represents the fusion of two clusters. A horizontal 
cut of H corresponds to a partition of the graph into a 
certain number of communities. The quality of each par- 
tition S was estimated using the modularity function [^ , 
which compares the abundance of edges lying inside each 
community with respect to a null model. In formula: 
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(3) 



where Ng is the number of clusters in the partition S, K 
is the total number of edges in the network, rus is the 
number of edges between vertices in cluster s, and fcs is 
the sum of the degrees of the nodes in cluster s. The 
optimal partition in communities of the graph G is the 
cut of the dendrogram H having maximum modularity. 
We notice that the choice of modularity to quantify the 
relevance of a partition is arbitrary, and that the method 
proposed here can be employed also with other partition 
quality functions [TU]. 

Benchmark tests.- The performance of om approach 
has been tested on two classes of benchmark graphs 
with tunable modular structure. In the first benchmark 
(GN), proposed by Girvan and Newman ^, each net- 
work consists of A^ = 128 nodes divided into 4 modules 
of equal size. Pairs of nodes in the same module are 
connected with probability pi„, while nodes belonging 
to different modules are linked with a probability Pout- 
Parameters are set such that the mean degree is kept 
constant {k) = 16. By appropriately tuning pi^ and 



Pout one can tune the percentage ^ of edges lying be- 
tween communities. The second class of modular graphs 
(LFR), proposed by Lancichinetti, Fortunato and Radic- 
chi [31], accounts for the heterogeneity in the distribu- 
tions of node degrees and community sizes. In this case, 
we generated modular graphs with scale-free degree dis- 
tribution P{k) ~ k~'^ and community size distribution 
P{s) ~ s"'', where 7 = 2 and 77 = 1. An appropriate 
tuning of the model parameters allows to create graphs 
with a prescribed fraction ^ of inter-community edges. 
We considered graphs having N = 500 and (fc) = 7 
(LFR500) and graphs with A^ = 2000 nodes and {k) = 28 
(LFR2000) . The average degree of each graph was tuned 
to maintain the global edge density. 

To compare the optimal partition produced by our 
bootstrap-based algorithm with the reference partition, 
we make use of an information theoretic measure: the 
variation of information (VI) [35]. In a nutshell, this 
non-negative metric quantifies how much information 
is lost and gained in changing from a partition A to 
a partition B. It can be estimated by V{A, B) = 
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, where c^ 



(c ) is the total number of clusters in the partition A (B), 
nf (nf) is the number of nodes in the i^^ (j"^) cluster of 
partition A (B), and n^j^ is the number of nodes shared 
by the i^^ cluster of partition A and the j*"^ cluster of 
partition B. Values of V range from 0, when A and B 
are identical partitions, to log A^ when the two partitions 
are randomly drawn. 

Fig. IT] shows the variation of information as a function 
of the fraction of inter-community edges fi. The reported 
results suggest that even when the graphs do not have 
any more a strong community structure, i.e. when each 




FIG. 2. (a) The best partition of the Zachary Karate 

club network obtained with non-parametric bootstrap for 
a. Q = 0.389 (each module is represented by a different 
color/symbol); (b) bootstrap distribution for the distance be- 
tween nodes i — 3 and j — 14: the small vertical bars indicate 
the value of the observed real network (gray) and the 50"^ per- 
centile of the distribution (black), respectively; (c) confidence 
intervals for the spectral distance between node i ~ 1, (top) 
i — 3 (middle), i — 34 (bottom) and all the other nodes rest of 
the nodes. Gray regions indicate the 0.05"'-95*'' percentiles 
interval of the bootstrap distribution. Notice that node 1 is 
much closer to the the nodes in its community (white circles) 
than to the other nodes. Similarly, node 34 is very close to the 
other nodes in its community (black circles), but quite distant 
from the rest of the graph. Finally, the distance between node 
3 and any node in the black community is comparable to that 
between node 3 and the white communities. 



node has more neighbours outside its community than 
inside it, the accuracy of the proposed bootstrap-based 
method remains pretty high. For GN networks, the accu- 
racy of the bootstrap-based method is comparable to that 
of a standard modularity optimization algorithm [7j |33] . 
For LFR500 and LFR2000, the non-parametric bootstrap 
method outperforms the other algorithms, even when we 
consider an embedding with /3 = 1, and exhibits a smaller 
value of VI up to relatively large values of ^. 

Testing on real networks.- The social network of friend- 
ships between members of a karate club at a US univer- 
sity, studied by the anthropologist Wayne Zachary in the 
1970s [M|, is a paradigmatic example of network with a 
strong modular structure. This network (N = 34 nodes 
and K = 78 edges) has been considered as a benchmark 
reference for most of the community detection algorithms 
proposed in the last decade. Fig.l2{a) shows the commu- 
nities detected by the bootstrap-based algorithm in the 
Zachary's network. It is worth noticing that the proposed 
method detects the two main modules (black and white 
circles, respectively) and an interface community between 
them which contains also nodes 3, 9 and 10, three nodes 
whose belonging to each of the two main modules has 



been considered unstable [TS] and which have been am- 
biguously classified by different algorithms [35l [36] . 

Variability of node attributes.- Bootstrap replicates 
can provide useful insight about the statistical signifi- 
cance of any structural property of a graph with un- 
known and unobservable distribution. For example, in 
Fig.[2Fb) we report the bootstrap distribution of the spec- 
tral distance between two nodes of the Karate Club net- 
work, namely, node 3 and node 14, together with the 
mean value (vertical gray line) and the median (vertical 
black line). Similarly, it is possible to compute the boot- 
strap confidence intervals (c.i.) of a given parameter 9 

as 6 e dbij),St{l - f ) , where 9b{oi) is the 100a per- 
centile of the bootstrap distribution of 6. For instance, 
if we consider a node i and compute the spectral dis- 
tance between i and all the other nodes in the network, 
we can associate a confidence interval to each of these 
distances. In Fig. W[c) we report the 90% confidence in- 
tervals (shaded gray) for the distance in the embedded 
Euclidean space M^ (here /? = 1, but qualitatively similar 
results are obtained for (3 — 5 and (3 = 10) between node 
i — 1 (top), i = 3 (middle) and z = 34 (bottom) and the 
rest of the network. The solid black line represents the 
average over the set of replicates. Notice that node 1 and 
node 34 exhibit a remarkable lower distance to the other 
nodes of the community to which they respectively be- 
long. Conversely, node 3 has a similar spectral distance 
to both the black and the white communities. This is 
probably the reason why this node is usually misclassi- 
fied by most community detection algorithms. 

Concluding remarks.- The present study combines 
probability theory and network science to assess the sta- 
tistical significance of structural network properties by 
means of a non-parametric resampling scheme. We pro- 
pose a method to detect communities based on the gener- 
ation of bootstrap replicates of the transition matrix as- 
sociated with an unbiased random walk on the graph, and 
we show that this method outperforms other community 
detection algorithms based on the spectral properties of 
the graph. Our results suggest that non-parametric boot- 
strapping of random walks can be of practical relevance 
for the study of the meso-scale structure of complex net- 
works. The non-parametric bootstrapping method de- 
scribed here could be also valuable to assess the statisti- 
cal significance of nodes attributes on the basis of other 
random walk parameters (e.g. hitting times), or for the 
statistical validation of other structural properties de- 
fined on different flavors of random walks [371 I3H] • 
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